A Study of Web Information Extraction Technology Based on Beautiful Soup
نویسندگان
چکیده
In the context of comparative analysis of common web information retrieval technologies, this article discusses the principles and applications of Beautiful Soup, a vertical information search technology based on DOM tree structure. Supported by actual system examples and centering on the system architecture and core technology, this article discusses how to use Beautiful Soup to conduct deep information retrieval for partially structured webpage data, obtain directional information, reorganize the information, and then send the information to users via text message. The test results demonstrate that the web crawler achieved over 95% accuracy, satisfying the needs for commercial application.
منابع مشابه
Data Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملEXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS
Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...
متن کاملThe relationship between social problem solving with acceptance and use of Web-based resources in educational-research activities adopting a Technology Acceptance Model (TAM)
Aim: Problem-solving is one of the most important issues in the field of psychology. It seems that solving social problems as an external variable has an important role in the acceptance of information technology. Therefore, the aim of the present study was to investigate the relationship between social problem solving and the use and acceptance of web-based resources in educational-research ac...
متن کاملIdentification and Classification of Desirable Web-Based Services from the Perspective of Website Users of Iran’s Hospitals Based on Kano Model of Customer Satisfaction
Background and Aim: A hospital website is an appropriate system for exchanging information and connecting patients, hospitals and medical staff. The purpose of this study was to identify and classify desirable web-based services in websites of Iran's hospitals based on Kano’s Customer Satisfaction Model. Materials and Methods: This was a survey study. The statistical population of the study co...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JCP
دوره 10 شماره
صفحات -
تاریخ انتشار 2015